Survival Rate Prediction DM Final Project

Data Processing

Handle Missing Values

For numeric columns, we impute missing value with interpolation.

EDA

Since there is serious collinearity among variables, so we decide to use PCA to conduct dimension reduction.

Feature Engineering

Encode categorical variables

Drop Unnecessnary Columns

PCA

From the PCA variance plot, we find that the first two components can explain roughly 85% of response variable. Thus, we decide to use first two componenets (0,1).

'd1_sysbp_noninvasive_max', 'd1_sysbp_max', 'h1_mbp_noninvasive_max', 'd1_diasbp_noninvasive_min', 'apache_3j_bodysystem_Gastrointestinal,'h1_mbp_noninvasive_min', 'h1_mbp_max', 'd1_sysbp_min'

VIF

There is no correlation value greater than 0.7. Thus, there is no multi-collinearity issue.

Since we still have high collinearity, we deicde to use interaction terms

SMOTE After Splitting Data

The density plot shows that the target variable "hospital_death" has an uneven distribution of observations. Class 0 has a very high number of observations and Class 1 has a very low number of observations. To deal with the imbalanced data, we will use SMOTE oversampling technique to generate random samples from minority class.

Now, data is balanced

Models

1. Random Forest

Since our test dataset is imbalanced, simply looking at accuracy could be misleading.

Thus, we introduce another metric, f1-score, to evaluate the model performance

Since CV accuracy and training accuracy are roughly equal, and CV accuracy is not smaller than our desired accuracy, which is 85%. Thus, our model does not have either overfit or underfit error.

2. Decision Tree

Since CV accuracy and training accuracy are roughly equal, and CV accuracy is not smaller than our desired accuracy, which is 85%. Thus, our model does not have either overfit or underfit error.

3. Logistic Regression

CV accuracy and training accuracy are roughly equal, and CV accuracy is much smaller than our desired accuracy, which is 85%. Thus, our model has an underfitting issue.

4. K-NN

Summary of model performance

By comparing the results we got from Random Forest, Decision Tree, Logistic Regression and K Neares Neighbor, we conclude that Random Forest delivered the best perfomance among all models we have. Therefore we decide to select Random Forest as our final model for prediction.